ColLex.en: Automatically Generating and Evaluating a Full-form Lexicon for English
نویسندگان
چکیده
The paper describes a procedure for the automatic generation of a large full-form lexicon of English. We put emphasis on two statistical methods to lexicon extension and adjustment: in terms of a letter-based HMM and in terms of a detector of spelling variants and misspellings. The resulting resource, ColLex.EN, is evaluated with respect to two tasks: text categorization and lexical coverage by example of the SUSANNE corpus and the Open ANC.
منابع مشابه
Automatic Generation of Multilingual Lexicon by Using Wordnet
A lexicon is the heart of any language processing system. Accurate words with grammatical and semantic attributes are essential or highly desirable for any applicationbe it machine translation, information extraction, various forms of tagging or text mining. However, good quality lexicons are difficult to construct requiring enormous amount of time and manpower. In this paper, we present a meth...
متن کاملDeveloping and Evaluating a Searchable Swedish-Thai Lexicon
We present an automatically created Swedish-Thai lexicon. The lexicon was created by matching the English translations in a Thai-English and a Swedish-English lexicon. The search interface to the lexicon includes several NLP tools to help the target group: second language learners of Swedish. These include automatic generation of inflectional forms of words, automatic spelling correction, lemma...
متن کامل-1 - Machine Translation without a Bilingual Dictionary
This paper outlines experiments conducted to determine the contribution of the traditional bilingual dictionary in the automatic alignment process to learn translation patterns, and at runtime. We found that by using automatically derived translation word pairs combined with a function word only lexicon, we were able to either match or nearly match the translation quality of the system that use...
متن کاملMultilingual Aliasing for Auto-Generating Proposition Banks
Semantic Role Labeling (SRL) is the task of identifying the predicate-argument structure in sentences with semantic frame and role labels. For the English language, the Proposition Bank provides both a lexicon of all possible semantic frames and large amounts of labeled training data. In order to expand SRL beyond English, previous work investigated automatic approaches based on parallel corpor...
متن کاملAutomatic Lexicon Generation through WordNet
A lexicon is the heart of any language processing system. Accurate words with grammatical and semantic attributes are essential or highly desirable for any application – be it machine translation, information extraction, various forms of tagging or text mining. However, good quality lexicons are difficult to construct requiring enormous amount of time and manpower. In this paper, we present a m...
متن کامل